In [3]:
import pandas as pd
import glob
import os
import numpy as np
from __future__ import division
import sqlalchemy as sqla
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib

Intro: Data and goals

Data comes from two sources:

  • TLC data: (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml or (just includes new data) https://data.cityofnewyork.us/data?agency=Taxi+and+Limousine+Commission+%28TLC%29&cat=&type=new_view&browseSearch=&scope= )
    • This data includes gps coordinates for every single yellow cab trip in NYC, as well as fare data, the taxi medallion, and a bunch of other info
  • Census/American Community Survey (ACS) data:
    • Income, commute data, and much, much more.
    • Data from the census makes the best possible attempt to record every single resident of the US, but is mostly just counting population, with only a small amount of other info.
    • Data from ACS uses a random sample and gets diverse demographic information; though some of this information may have very high errors.
    • Lots of sources, I've used https://nhgis.org/, as it allows you to just download the columns you want and not waste hard drive space or time processing lots of data.
    • The census supplies "shapefiles" of the different geographical areas they use at https://www.census.gov/geo/maps-data/data/tiger-line.html. I divided mine up by census tract, so I got census tract shapefiles.
    • Unfortunately, for some fields, there is simply too much error for them to be useful. The ACS contains info on the length of time spent commuting; for car commuters, the error for their data in NYC had a median around 50% and was simply not usable.
    • It will be interesting to see what effect race has on taxi dropoffs - does the conventional wisdom on taxis being prejudiced show up in dropoff data?
  • I have also taken a look at some Google maps distance data, but it wasn't very predictive on its own. Maybe combining in some of the demographic data will make it a bit more useful.

My goal will be to predict the number of taxi dropoffs in an area based on demographic and other stats about the area. Ideally, I would like to understand the demographics of taxicab users and how well taxicabs provide service to NYC residents. I may later expand my data to look at green "boro cabs" (which are only allowed to pick up passengers outside of the Manhattan central business district, data also availible from TLC) and Uber cars(https://github.com/fivethirtyeight/uber-tlc-foil-response), to compare their demographics of usage.

If I do something with the Uber data, I plan to write up a story in the style of fivethirtyeight.com that I may submit to them, as they have done a series of stories on this Uber data.

I would also expect that a paper comparing taxi services would be useful for city government, who want to ensure that there is good coverage throughout the city and throughout demographic groups.

Analysis

I'm only using census tracts where the population is greater than 1000. Most of the tracts with population less than that are parks or ocean, and the errors associated with data on these are generally very high. Also, taxi dropoffs are probably not representative of the resident population for these areas.


In [4]:
engine = sqla.create_engine('postgresql://mydatabase:notyours@localhost:5432/TaxiData',echo=False)
full=pd.read_sql_query('SELECT a.* FROM lotsofdata AS a WHERE a.totalpopulation>=1000',engine).set_index('fipscodes')

For this analysis, I'll be looking at data from three columns of the ACS, with per-capita dropoffs as my dependent variable, i.e. the dropoffs computed above divided by the population given by the census.

This data uses the 2010 census (the most recent), and 5-year data for the American Community Survey, 2007-2011. I figure the ACS data splits the difference between my 2010 population data and the 2013 drop-off data.

The census uses some funky abreviations for their column headers, and I haven't changed them. Here's a description:

  • MOGE001 : Workers 16 years and over who did not work at home.
  • MOGE011 : Workers 16 years and over who did not work at home and commuted by car, truck, or van.
  • MRUE001 : Per Capita Income in the Past 12 Months (in 2011 Inflation-Adjusted Dollars)

I compute the ratio of non-driving commuters to total commuters, labeled nondrivercommuterrat


In [5]:
subset=full[['MOGE001','MOGE011','MRUE001','abridged2013ycdrpoffpc']]
subset['nondrivercommuterrat']=((subset['MOGE001']-subset['MOGE011'])/subset['MOGE001'])
subset.replace(np.inf,np.nan,inplace=True)
subset.dropna(inplace=True)
subset.head()


/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[5]:
MOGE001 MOGE011 MRUE001 abridged2013ycdrpoffpc nondrivercommuterrat
fipscodes
36081150702 1337 1003 39608 0.043738 0.249813
36085031902 1567 887 21271 0.004538 0.433950
36047105804 1946 698 16944 0.028478 0.641316
36061000201 986 180 11770 0.828973 0.817444
36085013400 1711 1196 35314 0.007096 0.300994

In [7]:
# Plotting the data:
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0
buff=100

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.autoscale(tight=True)
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
a=plt.scatter(subset['MRUE001'],subset['abridged2013ycdrpoffpc'],alpha=0.5,label='Data',c=subset['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
#plt.legend(loc='upper left')
plt.show()


There appears to be a bit of a linear relationship between the logs of the dropoffs per-capita and per-capita income, but it's not that great. There is a funky area where there almost looks like there is a negative slope, but we can see that this area tends to have less non-driver commuters based on the color shown.


In [28]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('cool'))


#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


If we look at the data with only a high non-driver commuter ratio, the linear relationship is very clear. That funky corner only consists of lower non-driver commuter ratios.

As for the rest of the data, the relationship isn't really obvious. At least using just per-capita income as a dependent variable, there doesn't seem to be much correlation (see blob below), and I will need to explore other variables. I might want to use some more advance machine learning techniques in order to figure out which data is useful.


In [9]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']<comd)]
bsub=subset[(subset['nondrivercommuterrat']>comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $<$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))

plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $>$ 0.8')

#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


We can do some regression on the high non-driver commuter data though:


In [31]:
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()


Out[31]:
OLS Regression Results
Dep. Variable: abridged2013ycdrpoffpc R-squared: 0.689
Model: OLS Adj. R-squared: 0.689
Method: Least Squares F-statistic: 1507.
Date: Sat, 02 Jan 2016 Prob (F-statistic): 1.17e-174
Time: 17:51:15 Log-Likelihood: -1086.7
No. Observations: 682 AIC: 2177.
Df Residuals: 680 BIC: 2187.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
MRUE001 2.2805 0.059 38.820 0.000 2.165 2.396
const -23.8229 0.605 -39.367 0.000 -25.011 -22.635
Omnibus: 50.788 Durbin-Watson: 1.209
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.269
Skew: 0.586 Prob(JB): 3.34e-16
Kurtosis: 4.065 Cond. No. 138.

In [30]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()



In [32]:
fig=sm.qqplot(fit.resid,line='45')

fig.show()



In [135]:
hist=fit.resid.hist(alpha=0.8)


I have to admit, I'm not quite sure what to think of these plots. Clearly my data isn't exactly normally distributed, but this is the real world so that's not surprising. Does it deviate from a normal distribution enough that I should be worried? I don't know.

I'm going to take a look at the outliers and high leverage points regardless:


In [14]:
from statsmodels.graphics import regressionplots

from statsmodels.stats import outliers_influence

influence=outliers_influence.OLSInfluence(fit)

infframe=influence.summary_frame()

outliers=infframe[np.abs(infframe['student_resid'])>3].index

In [45]:
asub.loc[outliers]


Out[45]:
MOGE001 MOGE011 MRUE001 abridged2013ycdrpoffpc nondrivercommuterrat
fipscodes
36061010100 502 38 69730 288.611111 0.924303
36061011900 361 0 23206 151.490179 1.000000
36061019600 932 77 11176 4.876876 0.917382
36061020300 1128 55 8484 5.658134 0.951241
36061024000 131 5 5517 0.947209 0.961832

In [46]:
outliers.tolist()


Out[46]:
[u'36061010100',
 u'36061011900',
 u'36061019600',
 u'36061020300',
 u'36061024000']

In [49]:
from IPython.display import Image
Image(filename='Outliers.png')


Out[49]:

These five outliers consist of:

  1. The tract centered on Times Square, which is a major transit hub and contains one of the largest residences for the homeless in the US.
  2. The tract centered on Penn Station has the biggest transit hub in NYC.
  3. The tract that consists only of Columbia University. I imagine that the population there is mostly students in dorms, many of which probably have nearly no income but really rich parents.
  4. There is another one centered on the 125th street Metro-North train station. This one is the most surprising to me, as I didn't think it was that big of a transit hub.
  5. Finally, there is one that consists of Randall's/Ward's Island. I'm fairly certain that just about all the population is people who live in the various mental institutions and homeless shelters on the island. It also has terrible public transit access, but lots of sports fields and event spaces, so I imagine that there are plenty of people going to events that take cabs.

In [52]:
asub.drop(outliers,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()


/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[52]:
OLS Regression Results
Dep. Variable: abridged2013ycdrpoffpc R-squared: 0.718
Model: OLS Adj. R-squared: 0.718
Method: Least Squares F-statistic: 1719.
Date: Mon, 04 Jan 2016 Prob (F-statistic): 9.76e-188
Time: 14:04:33 Log-Likelihood: -1039.9
No. Observations: 677 AIC: 2084.
Df Residuals: 675 BIC: 2093.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
MRUE001 2.3166 0.056 41.462 0.000 2.207 2.426
const -24.2279 0.576 -42.076 0.000 -25.359 -23.097
Omnibus: 9.661 Durbin-Watson: 1.185
Prob(Omnibus): 0.008 Jarque-Bera (JB): 9.286
Skew: 0.249 Prob(JB): 0.00963
Kurtosis: 2.716 Cond. No. 138.

In [53]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


Removing the outliers pops up the $R^2$ and, more importantly, creates a more representative fit line.

Next I'll look at high leverage points:


In [63]:
highleverage=infframe[infframe['hat_diag']>(infframe['hat_diag'].mean()*3)].index

In [77]:
Image(filename='HighL_Outliers.png')


Out[77]:

These are some of the highest income census tracts in the US (world?), possibly the highest. I feel like I read that somewhere.

Dropping these high leverage points as well as the outliers gives us:


In [64]:
asub.drop(highleverage,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()


/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[64]:
OLS Regression Results
Dep. Variable: abridged2013ycdrpoffpc R-squared: 0.714
Model: OLS Adj. R-squared: 0.713
Method: Least Squares F-statistic: 1670.
Date: Mon, 04 Jan 2016 Prob (F-statistic): 3.90e-184
Time: 14:28:25 Log-Likelihood: -1032.9
No. Observations: 672 AIC: 2070.
Df Residuals: 670 BIC: 2079.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
MRUE001 2.3369 0.057 40.869 0.000 2.225 2.449
const -24.4302 0.588 -41.517 0.000 -25.586 -23.275
Omnibus: 9.225 Durbin-Watson: 1.172
Prob(Omnibus): 0.010 Jarque-Bera (JB): 8.897
Skew: 0.245 Prob(JB): 0.0117
Kurtosis: 2.722 Cond. No. 141.

The $R^2$ is slightly lower, but the parameters are probably better representative of the whole of the data.


In [65]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


Below is all the data, so that you can see the dropped outliers and high leverage points as well. There are a total of ten out of the 682 total data points that are dropped, so we are dropping a very small fraction of the data.


In [76]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.scatter(subset.loc[outliers]['MRUE001'],subset.loc[outliers]['abridged2013ycdrpoffpc'],alpha=0.25,label='Outliers',color='red')
plt.scatter(subset.loc[highleverage]['MRUE001'],subset.loc[highleverage]['abridged2013ycdrpoffpc'],alpha=0.3,label='High leverage',color='orange')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='lower right')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


Now we can see that the remaining data is much closer to normally distributed:


In [71]:
fig=sm.qqplot(fit.resid,line='45')

fig.show()



In [72]:
hist=fit.resid.hist(alpha=0.8)


We've got a fit that explains most of the variance, with an $R^2$ of 0.7. I suspect that I won't be able to further explain this subset (non-driver commute ratio > 0.8) of the data, but maybe I will.

However, there is still lots to do. I should probably do an analysis of error in this data, to see if there is really any more variance that could possibly be explained (i.e. much of the variance might be due to random fluctuations). I also might want to do some analysis to better understand the causal relationship between these two variables. Many of these dropoffs are not very representative of the population, but instead are common destinations, eg Grand Central and Penn Stations, or the area around Central Park. In this case, the common destination causes high rents and thus high incomes, so it can't be concluded that the rich are more likely to take cabs.

Also, to really do this analysis properly, I should't be using all the data and should do some sampling. I figure I can get to that once I've got a model that explains the rest of the data as well.


In [ ]: